CoMMEDIA: Separating Scaramouche from Harlequin to Accurately Estimate Items Frequency in Distributed Data Streams

نویسندگان

  • Emmanuelle Anceaume
  • Yann Busnel
چکیده

In this paper, we investigate the problem of estimating the number of times data items that recur in very large distributed data streams. We present an alternative approach to the well-known CountMin Sketch in order to reduce the impact of collisions on the accuracy of the estimation. We propose to decrease, for each concerned item, the over-estimation that results from these collisions. Our sketch, called CoMMEDIETTA, keeps track of the most frequent items of the stream, and removes their weight from the one of the items with which these frequent items collide. By doing so, we significantly improve upon the Count-Min Sketch by achieving a randomized (ε, δ)-approximation algorithm. We then propose to judiciously distribute this local sketch to estimate the global frequency of any item that may recur in multiple streams. This distributed sketch, called CoMMEDIA (for Count-Min Sketch-based Estimation of Data Items Arrival frequency), organizes nodes of the system in a distributed hash table (DHT) such that each node implements a tiny local sketch on a reduced number of items. By doing so we guarantee a significantly more accurate estimation of item frequencies. Simulations both on synthetic and real traces confirm the accuracy of CoMMEDIA.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Separating indexes from data: a distributed scheme for secure database outsourcing

Database outsourcing is an idea to eliminate the burden of database management from organizations. Since data is a critical asset of organizations, preserving its privacy from outside adversary and untrusted server should be warranted. In this paper, we present a distributed scheme based on storing shares of data on different servers and separating indexes from data on a distinct server. Shamir...

متن کامل

DETERMINING OF REGIONAL COEFFICIENTS OF FULLER'S EMPIRICAL FORMULA TO ESTIMATE MAXIMUM INSTANTANEOUS DISCHARGES IN DASHT KAVIR BASIN, KALSHOUR SABZEVAR, IRAN

Estimation of the magnitude and frequency of maximum instantaneous discharges and hydrographs are used for a variety of purposes, such as the design of bridges, culverts, flood-control structures; and the management and regulation of floodplains. Fuller (1914), developed a flood-frequency formula based on analysis of flood peaks in hundred of streams to provide simple methods of estimating maxi...

متن کامل

Error-Adaptive and Time-Aware Maintenance of Frequency Counts over Data Streams

Maintaining frequency counts for items over data stream has a wide range of applications such as web advertisement fraud detection. Study of this problem has attracted great attention from both researchers and practitioners. Many algorithms have been proposed. In this paper, we propose a new method, error-adaptive pruning method, to maintain frequency more accurately. We also propose a method c...

متن کامل

Finding Frequent Items in Data Streams

We present a 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on ...

متن کامل

Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams

Online mining changes over data streams has been recognized to be an important task in data mining. Mining changes over data streams is both compelling and challenging. In this paper, we propose a new, single-pass algorithm, called MFC-append (Mining Frequency Changes of append-only data streams), for discovering the frequent frequency-changed items, vibrated frequency changed items, and stable...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013